Principal Component Analysis

Lecture seven

Miguel Rodo (presenting edited slides from Francesca Little)

Context and general approach

Context

  • Multiple variables (possible many, possibly from more than one dataset) that we wish to relate to one another and/or an outcome variable
  • Willing to accept a loss of information for information being more summarised, and/or
  • Believe that a simpler structure underlies the observed data

Basis of approach

\[ x_1u_1 + x_2u_2 + \dots + x_pu_p \]

Why linear combinations?

flowchart LR
  A[Simplest way to combine variables \n Strong theoretical results] --> B(Robustness \n Speed \n Flexibility)

Techniques and specific context

  • Principal component analysis: \(\mathbf{Y}\)
  • Factor analysis: \(\mathbf{Y} \sim ?\)
  • Canonical correlation analysis: \(\mathbf{Y} \sim \mathbf{X}\)
  • Correspondence analysis: \(\mathbf{Y}\) (count data)
  • Discrimination and classification: \(\mathbf{Y} \sim \mathbf{G}\) and \(\mathbf{G} \sim \mathbf{Y}\)

Simple example

Principal components

  • We restrict our attention to creating composite (new) variables that are linear combinations of the original variables.
  • These are termed principal components.
  • Assuming that our data \(\mathbf{X} \in \mathcal{R}^p\)are distributed \((\mathbf{0}, \mathbf{\Sigma})\), this implies that a linear combination \(\mathbf{a}'\mathbf{X}\) is distributed \((0, \mathbf{a}'\mathbf{\Sigma}\mathbf{a})\).
  • Since the variance can be made arbitrarily large by increasing the length of \(\mathbf{a}\), we place the restriction that \(||\mathbf{a}||=1\).
  • We wish to find \(p\) principal components, where
    • the first principal component has maximal variance, and
    • all later principal components have maximal variance, subject to the restriction that they are orthogonal to all previous principal components.

Preliminary result

Theorem 1 (Maximisation of quadratic forms for points on the unit sphere) Let \(\mathbf{B}:p\times p\) be a positive semi-definite matrix with the \(i\)-th largest eigenvalue \(\lambda_i\) and associated eigenvector \(\mathbf{e}_i\). Then we have that

\[ \max_{\mathbf{x}\neq \mathbf{0}} \frac{\mathbf{x}'\mathbf{B}\mathbf{x}}{\mathbf{x}'\mathbf{x}}=\lambda_1 \; (\mathrm{attained}\; \mathrm{when} \; \mathbf{x}=\mathbf{e}_1) \]

and that

\[ \max_{\mathbf{x}\neq \mathbf{0},\mathbf{x}\perp \mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_{i-1}} \frac{\mathbf{x}'\mathbf{B}\mathbf{x}}{\mathbf{x}'\mathbf{x}}=\lambda_i \; (\mathrm{attained}\; \mathrm{when} \; \mathbf{x}=\mathbf{e}_{i}) \]

for \(i\in\{2,3, \ldots, p\}\).

Proof of Theorem 1

Proof. Let \(\mathbf{P}:p\times p\) be the orthogonal matrix whose \(i\)-th column is the \(i\)-th eigenvector and \(\Lambda\) be the diagonal matrix with ordered eigenvalues along the diagonal. Let \(\mathbf{B}^{1/2}=\mathbf{P}\mathbf{\Lambda}^{1/2}\mathbf{P}'\) and \(\mathbf{y}=\mathbf{P}'\mathbf{x}\).

First, we show that the quadratic form can never be larger than \(\lambda_1\):

\[ \begin{align*} \frac{\mathbf{x}'\mathbf{B}\mathbf{x}}{\mathbf{x}'\mathbf{x}} &= \frac{\mathbf{x}'\mathbf{B}^{1/2}\mathbf{B}^{1/2}\mathbf{x}}{\mathbf{x}'\mathbf{P}\mathbf{P}'\mathbf{x}} = \frac{\mathbf{x}'\mathbf{P}\mathbf{\Lambda}^{1/2}\mathbf{P}'\mathbf{P}\mathbf{\Lambda}^{1/2}\mathbf{P}'\mathbf{x}}{\mathbf{y}'\mathbf{y}} = \frac{\mathbf{y}'\mathbf{\Lambda}\mathbf{y}}{\mathbf{y}'\mathbf{y}} \\ &= \frac{\sum_{i=1}^p\lambda_iy_i^2}{\sum_{i=1}^py_i^2} \leq \lambda_1 \frac{\sum_{i=1}^py_i^2}{\sum_{i=1}^py_i^2} = \lambda_1 \end{align*} \]

Proof of Theorem 1 (cont.)

Now we show that this is actually attained for \(\mathbf{x}=\mathbf{e}_1\). Since eigenvectors are by convention length 1, we consider \(\mathbf{e}_1'\mathbf{B}\mathbf{e}_1\).

First, let \(\mathbf{c}_i\) be the unit vector with a 1 in the \(i\)-th position (and 0’s everywhere else).

Expanding \(\mathbf{B}\) by the eigen decomposition, we have that

\[ \mathbf{e}_1'\mathbf{B}\mathbf{e}_1=\mathbf{e}_1'\mathbf{P}\mathbf{\Lambda}\mathbf{P}'\mathbf{e}_1. \]

Since \(B\) is symmetric, the eigenvectors can be chosen orthogonal. We do so, which implies that \(\mathbf{e}_i'P=\mathbf{c}_i'\), where \(\mathbf{c}_i\) is the unit vector with a 1 in the \(i\)-th position (and 0 everwhere else). Consequently,

\[ \mathbf{e}_i'\mathbf{P}\mathbf{\Lambda}\mathbf{P}'\mathbf{e}_i = \mathbf{c}_i'\mathbf{\Lambda}\mathbf{c}_i=\lambda_i, \]

and so \(\mathbf{e}_1'\mathbf{P}\mathbf{\Lambda}\mathbf{P}'\mathbf{e}_1=\lambda_1\).

Proof of Theorem 1 (cont.)

Now, we consider the case where \(\mathbf{x}\) is orthogonal to \(\mathbf{e}_1, \mathbf{e}_2, \ldots, \mathbf{e}_{i-1}\).

Each component in the vector \(\mathbf{y}\) is the dot product of \(x\) and an eigenvector \(\mathbf{e}_i\). Since we choose \(x\) orthogonal to the first \(i-1\) eigenvectors, the first \(i-1\) entries of \(\mathbf{y}\) are zero.

Returning to considering the quadratic form, we have that

\[ \begin{align} \frac{\mathbf{x}'\mathbf{B}\mathbf{x}}{\mathbf{x}'\mathbf{x}} &= \frac{\sum_{j=1}^p\lambda_jy_j^2}{\sum_{j=1}^py_j^2} \\ &= \frac{\sum_{j=i}^p\lambda_jy_j^2}{\sum_{j=i}^py_j^2} \leq \lambda_i \frac{\sum_{j=i}^py_j^2}{\sum_{j=i}^py_j^2} = \lambda_i \end{align} \]

Using the same argument as before, we can show that this is actually attained for \(\mathbf{x}=\mathbf{e}_i\).

Connection with principal components

  • Theorem 1 has already done the heavy lifting, as we can simply set \(\mathbf{B}=\mathbf{\Sigma}\).
  • All that’s left to do is place Theorem 1 in a probabilistic context, and relate it to the variance and covariance of linear combinations of a random vector.

Selecting principal components

Theorem 2 (Identifying successive variance-maximising and orthogonal linear combinations) Let \(\mathbf{\Sigma}:p\times p\) be the variance-covariance matrix with the \(i\)-th largest eigenvalue \(\lambda_i\) and associated eigenvector \(\mathbf{e}_i\). Then the \(i\)-th principal component (defined as before) is given by

\[ Y_i = \mathbf{e}_i'\mathbf{X}, \]

implying that

\[\begin{align*} \mathrm{Var}[Y_i] &= \lambda_i \;\forall \; i \in \{1,2, \ldots, p\}, \; \mathrm{and} \\ \mathrm{Cov}[Y_i, Y_j] &= 0 \; \mathrm{for} \; i\neq j. \end{align*}\]

Proof of Theorem 2

Proof. From Theorem 1, we know that

\[ \max_{\mathbf{x}} \frac{\mathbf{x}'\mathbf{\Sigma}\mathbf{x}}{\mathbf{x}'\mathbf{x}} = \lambda_1, \]

which we achieve by setting \(\mathbf{x}=\mathbf{e}_1\).

Since \(||\mathbf{e}_1||=1\), \(\frac{\mathbf{e}_1'\mathbf{\Sigma}\mathbf{e}_1}{\mathbf{e}_1'\mathbf{e}_1}=\mathbf{e}_1'\mathbf{\Sigma}\mathbf{e}_1\), which is equal to \(\mathrm{Var}[\mathbf{a}'\mathrm{X}]=\mathrm{Var}[Y_1]\) if we set \(\mathbf{a}=\mathbf{e}_1\). This implies that \(\mathbf{a}=\mathbf{e}_1\) maximises the variance, which is \(\lambda_1\).

Analagous reasoning shows that \(\mathrm{Var}[\mathbf{a}_i'\mathbf{X}]\) has a maximum value \(\lambda_i\) for \(i\in\{2,3,\ldots,p\}\) that is attained when we set \(\mathbf{a}_i=\mathbf{e}_i\), under the restriction \(\mathbf{a}_i\) to be orthogonal to \(\mathbf{a}_1, \mathbf{a}_2, \ldots, \mathbf{a}_{i-1}\).

For \(i\neq j\), \(\mathrm{Cov}[\mathbf{e}_i'\mathbf{X},\mathbf{e}_j'\mathbf{X}] = \mathbf{e}_i'\mathbf{\Sigma}\mathbf{e}_j = 0\).

A note on terminology

  • Principal components (the \(\mathbf{Y}_i\)’s) may also be referred to as “scores”.
  • The coefficient vectors \(\mathbf{a}_i\) may also be referred to as “loadings” (thinking algebraically) or as “principal axes”/“principal directions” (thinking geometrically).

Theorem 3: Equality of total variances

Theorem 4: Correlation with original variables

Example 1

Implication if \(\mathbf{X} \sim \mathrm{MVN}\)

Pre-standardising the variances

  • Before applying PCA, one can standardise the variances to have unit variance.
  • This may be appropriate when:
    • the variances have different units (e.g. weight in kg vs annual income in rands), or
    • the variances have different scales (e.g. some genes are extremely rare whereas otherw are abundantly expressed), or
    • we wish to eliminate any effect of differences in variance on downstream analyses, e.g. when applying penalised regression.
  • Similar comments go for logging the data, or applying other transformations.
  • All results go through in exactly the same way, with adjustments made in interpreting exactly what variables the principal components capture.

Example of pre-standardising variances

Population versus sample principal components

Approximating a matrix \(\mathbf{X}\)

Theorom 5a: Using the SVD to approximate a matrix

Theorom 5b: PCA leads to same approximation as the SVD

From Theorem 5a, we know that the best rank \(k\) approximation of \(\mathbf{X}\) is given by \(\hat{\mathbf{X}}=\mathbf{U}\mathbf{D}\mathbf{J}_kV'\).

We wish to show that the matrix approximation \(\hat{X}\) constructed from the first \(k\) principal components is the same. Letting \(Y:n\times p\) be the matrix of scores and \(\mathbf{P}:p\times p\) be the matrix of loadings, we have that

\[ \mathbf{Y} =\mathbf{X}\mathbf{P} \implies \mathbf{Y}\mathbf{P}^{-1} =\mathbf{X}. \]

We can make us of only the first \(k\) principal components by inserting the matrix \(\mathbf{J}_k\):

\[ \mathbf{Y}\mathbf{J}_k\mathbf{P}^{-1} =\tilde{\mathbf{X}}. \]

By definition, \(\mathbf{P}=\mathbf{V}\) as \(\mathbf{P}\) and \(\mathbf{V}\) both have the ordered eigenvectors of \(\mathbf{X}'\mathbf{X}\) along the columns. Since the eigenvectors ae orthogonal, \(\mathbf{P}^{-1}=\mathbf{V}'\). We can therefore write

\[ \mathbf{Y}\mathbf{J}_k\mathbf{V}' =\tilde{\mathbf{X}}. \]

Theorom 5b: PCA leads to same approximation as the SVD (cont.)

It remains to show that \(\mathbf{Y}=\mathbf{U}\mathbf{D}\). Since \[ \mathbf{Y}=\mathbf{X}\mathbf{P}=\mathbf{X}\mathbf{V}, \]

we have that

\[ \mathbf{Y}=\mathbf{U}\mathbf{D}\mathbf{V}'\mathbf{V}=\mathbf{U}\mathbf{D} \]

by orthogonality of \(\mathbf{V}\).

Number of components to retain

Final example

Final example (cont.)

Final example (cont.)

Biplots and the PCA biplot I

Biplots and the PCA biplot II

Biplots and the PCA biplot III

Content covered

Class exercise